1 Introduction

1.1 What is FastQC?

  • Modern high throughput sequencers can generate hundreds of millions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may affect how you can usefully use it.

  • Most sequencers will generate a QC report as part of their analysis pipeline, but this is usually only focused on identifying problems which were generated by the sequencer itself. FastQC aims to provide a QC report which can spot problems which originate either in the sequencer or in the starting library material.

  • FastQC can be run in one of two modes. It can either run as a stand alone interactive application for the immediate analysis of small numbers of FastQ files, or it can be run in a non-interactive mode where it would be suitable for integrating into a larger analysis pipeline for the systematic processing of large numbers of files.

1.2 What is Cutadapt?

  • Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

  • Cutadapt helps with trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter single-end and paired-end reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Cutadapt can also demultiplex your reads.

3 Tutorial

In this tutorial we are going to learn how to use FastQC and Cutadapt to quality control and trim adapter sequences from FASTQ files. The first thing to do is create a directory to store all the tutorial data. It is good practice to create a new directory for each project you work on, this ensures files do not get mixed up and all the results are self-contained.

Create a ‘tutorial’ directory to store output files:

bash
mkdir tutorial

Download the tutorial and exercise data:

bash
curl https://media.githubusercontent.com/media/zifornd/bioinformatics-workshop/main/data/quality/data.tar.gz --output tutorial/data.tar.gz
##   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
##                                  Dload  Upload   Total   Spent    Left  Speed
## 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0 61.4M    0 13851    0     0   3834      0  4:40:07  0:00:03  4:40:04  3838
  0 61.4M    0  607k    0     0   133k      0  0:07:51  0:00:04  0:07:47  133k
  1 61.4M    1  751k    0     0   135k      0  0:07:45  0:00:05  0:07:40  177k
  1 61.4M    1  783k    0     0   118k      0  0:08:49  0:00:06  0:08:43  183k
  1 61.4M    1  863k    0     0   114k      0  0:09:10  0:00:07  0:09:03  204k
  1 61.4M    1  879k    0     0   102k      0  0:10:13  0:00:08  0:10:05  174k
  1 61.4M    1  927k    0     0  99353      0  0:10:48  0:00:09  0:10:39 65392
  1 61.4M    1  943k    0     0  91130      0  0:11:47  0:00:10  0:11:37 39009
  1 61.4M    1  991k    0     0  87684      0  0:12:14  0:00:11  0:12:03 42647
  1 61.4M    1 1055k    0     0  85399      0  0:12:34  0:00:12  0:12:22 38565
  1 61.4M    1 1103k    0     0  83304      0  0:12:53  0:00:13  0:12:40 45943
  2 61.4M    2 1280k    0     0  89725      0  0:11:58  0:00:14  0:11:44 71478
  2 61.4M    2 1311k    0     0  86350      0  0:12:26  0:00:15  0:12:11 76098
  2 61.4M    2 1359k    0     0  83693      0  0:12:49  0:00:16  0:12:33 74548
  2 61.4M    2 1408k    0     0  81933      0  0:13:06  0:00:17  0:12:49 73027
  2 61.4M    2 1487k    0     0  82079      0  0:13:05  0:00:18  0:12:47 78737
  2 61.4M    2 1695k    0     0  88818      0  0:12:05  0:00:19  0:11:46 86123
  2 61.4M    2 1839k    0     0  91385      0  0:11:45  0:00:20  0:11:25  104k
  3 61.4M    3 1935k    0     0  91860      0  0:11:41  0:00:21  0:11:20  116k
  3 61.4M    3 1999k    0     0  90503      0  0:11:52  0:00:22  0:11:30  117k
  3 61.4M    3 2048k    0     0  88905      0  0:12:04  0:00:23  0:11:41  111k
  3 61.4M    3 2143k    0     0  89399      0  0:12:00  0:00:24  0:11:36 91677
  3 61.4M    3 2399k    0     0  95894      0  0:11:11  0:00:25  0:10:46  111k
  3 61.4M    3 2463k    0     0  94919      0  0:11:18  0:00:26  0:10:52  105k
  4 61.4M    4 2575k    0     0  95695      0  0:11:13  0:00:27  0:10:46  116k
  4 61.4M    4 2688k    0     0  96391      0  0:11:08  0:00:28  0:10:40  128k
  4 61.4M    4 2783k    0     0  96384      0  0:11:08  0:00:29  0:10:39  127k
  4 61.4M    4 2816k    0     0  94339      0  0:11:23  0:00:30  0:10:53 86252
  4 61.4M    4 2880k    0     0  93401      0  0:11:29  0:00:31  0:10:58 85319
  4 61.4M    4 3008k    0     0  94598      0  0:11:21  0:00:32  0:10:49 88528
  5 61.4M    5 3200k    0     0  97628      0  0:11:00  0:00:33  0:10:27  102k
  5 61.4M    5 3328k    0     0  98404      0  0:10:54  0:00:34  0:10:20  107k
  5 61.4M    5 3407k    0     0  98141      0  0:10:56  0:00:35  0:10:21  118k
  5 61.4M    5 3487k    0     0  97676      0  0:10:59  0:00:36  0:10:23  121k
  5 61.4M    5 3567k    0     0  97288      0  0:11:02  0:00:37  0:10:25  112k
  5 61.4M    5 3631k    0     0  96456      0  0:11:08  0:00:38  0:10:30 88574
  5 61.4M    5 3679k    0     0  95216      0  0:11:16  0:00:39  0:10:37 72887
  5 61.4M    5 3712k    0     0  93734      0  0:11:27  0:00:40  0:10:47 62335
  5 61.4M    5 3743k    0     0  92230      0  0:11:38  0:00:41  0:10:57 52418
  6 61.4M    6 3776k    0     0  90862      0  0:11:49  0:00:42  0:11:07 42603
  6 61.4M    6 3919k    0     0  92141      0  0:11:39  0:00:43  0:10:56 58915
  6 61.4M    6 4079k    0     0  93775      0  0:11:27  0:00:44  0:10:43 82296
  6 61.4M    6 4367k    0     0  98187      0  0:10:56  0:00:45  0:10:11  131k
  7 61.4M    7 4815k    0     0   103k      0  0:10:08  0:00:46  0:09:22  213k
  7 61.4M    7 4975k    0     0   104k      0  0:10:01  0:00:47  0:09:14  240k
  8 61.4M    8 5167k    0     0   106k      0  0:09:51  0:00:48  0:09:03  249k
  8 61.4M    8 5327k    0     0   107k      0  0:09:45  0:00:49  0:08:56  249k
  8 61.4M    8 5504k    0     0   108k      0  0:09:38  0:00:50  0:08:48  226k
  9 61.4M    9 5727k    0     0   111k      0  0:09:26  0:00:51  0:08:35  183k
  9 61.4M    9 5935k    0     0   112k      0  0:09:17  0:00:52  0:08:25  191k
  9 61.4M    9 6127k    0     0   114k      0  0:09:10  0:00:53  0:08:17  192k
 10 61.4M   10 6319k    0     0   115k      0  0:09:03  0:00:54  0:08:09  198k
 10 61.4M   10 6735k    0     0   121k      0  0:08:39  0:00:55  0:07:44  246k
 11 61.4M   11 7055k    0     0   124k      0  0:08:24  0:00:56  0:07:28  265k
 11 61.4M   11 7311k    0     0   127k      0  0:08:15  0:00:57  0:07:18  275k
 12 61.4M   12 7552k    0     0   128k      0  0:08:07  0:00:58  0:07:09  284k
 12 61.4M   12 7759k    0     0   130k      0  0:08:03  0:00:59  0:07:04  286k
 12 61.4M   12 7872k    0     0   129k      0  0:08:04  0:01:00  0:07:04  225k
 12 61.4M   12 7936k    0     0   128k      0  0:08:08  0:01:01  0:07:07  173k
 12 61.4M   12 8047k    0     0   128k      0  0:08:09  0:01:02  0:07:07  146k
 13 61.4M   13 8223k    0     0   129k      0  0:08:06  0:01:03  0:07:03  133k
 13 61.4M   13 8448k    0     0   130k      0  0:08:00  0:01:04  0:06:56  138k
 14 61.4M   14 8943k    0     0   136k      0  0:07:41  0:01:05  0:06:36  216k
 15 61.4M   15 9503k    0     0   142k      0  0:07:20  0:01:06  0:06:14  317k
 16 61.4M   16  9.8M    0     0   149k      0  0:06:59  0:01:07  0:05:52  416k
 16 61.4M   16 10.1M    0     0   151k      0  0:06:56  0:01:08  0:05:48  430k
 16 61.4M   16 10.3M    0     0   152k      0  0:06:52  0:01:09  0:05:43  431k
 17 61.4M   17 10.5M    0     0   152k      0  0:06:52  0:01:10  0:05:42  364k
 17 61.4M   17 10.6M    0     0   152k      0  0:06:53  0:01:11  0:05:42  278k
 17 61.4M   17 10.7M    0     0   151k      0  0:06:54  0:01:12  0:05:42  178k
 17 61.4M   17 10.9M    0     0   152k      0  0:06:53  0:01:13  0:05:40  166k
 18 61.4M   18 11.1M    0     0   152k      0  0:06:51  0:01:14  0:05:37  158k
 18 61.4M   18 11.3M    0     0   154k      0  0:06:47  0:01:15  0:05:32  179k
 19 61.4M   19 11.7M    0     0   156k      0  0:06:40  0:01:16  0:05:24  223k
 20 61.4M   20 12.4M    0     0   164k      0  0:06:22  0:01:17  0:05:05  349k
 21 61.4M   21 12.9M    0     0   168k      0  0:06:13  0:01:18  0:04:55  408k
 21 61.4M   21 13.2M    0     0   170k      0  0:06:09  0:01:19  0:04:50  430k
 21 61.4M   21 13.3M    0     0   169k      0  0:06:11  0:01:20  0:04:51  393k
 21 61.4M   21 13.3M    0     0   168k      0  0:06:14  0:01:21  0:04:53  339k
 22 61.4M   22 13.5M    0     0   168k      0  0:06:14  0:01:22  0:04:52  220k
 22 61.4M   22 13.7M    0     0   168k      0  0:06:12  0:01:23  0:04:49  176k
 22 61.4M   22 14.0M    0     0   169k      0  0:06:10  0:01:24  0:04:46  162k
 23 61.4M   23 14.2M    0     0   170k      0  0:06:08  0:01:25  0:04:43  195k
 23 61.4M   23 14.7M    0     0   174k      0  0:06:01  0:01:26  0:04:35  272k
 24 61.4M   24 15.2M    0     0   177k      0  0:05:53  0:01:27  0:04:26  342k
 25 61.4M   25 15.7M    0     0   182k      0  0:05:44  0:01:28  0:04:16  408k
 26 61.4M   26 16.1M    0     0   184k      0  0:05:41  0:01:29  0:04:12  428k
 26 61.4M   26 16.4M    0     0   185k      0  0:05:38  0:01:30  0:04:08  444k
 27 61.4M   27 16.6M    0     0   186k      0  0:05:37  0:01:31  0:04:06  399k
 27 61.4M   27 16.8M    0     0   186k      0  0:05:37  0:01:32  0:04:05  335k
 27 61.4M   27 17.1M    0     0   187k      0  0:05:35  0:01:33  0:04:02  281k
 28 61.4M   28 17.3M    0     0   188k      0  0:05:34  0:01:34  0:04:00  262k
 28 61.4M   28 17.7M    0     0   189k      0  0:05:31  0:01:35  0:03:56  259k
 29 61.4M   29 18.2M    0     0   193k      0  0:05:25  0:01:36  0:03:49  317k
 30 61.4M   30 18.7M    0     0   197k      0  0:05:18  0:01:37  0:03:41  397k
 30 61.4M   30 18.9M    0     0   196k      0  0:05:19  0:01:38  0:03:41  367k
 31 61.4M   31 19.0M    0     0   196k      0  0:05:20  0:01:39  0:03:41  342k
 31 61.4M   31 19.0M    0     0   194k      0  0:05:23  0:01:40  0:03:43  284k
 31 61.4M   31 19.1M    0     0   193k      0  0:05:25  0:01:41  0:03:44  191k
 31 61.4M   31 19.1M    0     0   191k      0  0:05:28  0:01:42  0:03:46 81664
 31 61.4M   31 19.2M    0     0   190k      0  0:05:30  0:01:43  0:03:47 58990
 31 61.4M   31 19.4M    0     0   190k      0  0:05:30  0:01:44  0:03:46 84634
 31 61.4M   31 19.5M    0     0   189k      0  0:05:31  0:01:45  0:03:46   99k
 32 61.4M   32 19.8M    0     0   190k      0  0:05:30  0:01:46  0:03:44  134k
 32 61.4M   32 19.9M    0     0   190k      0  0:05:31  0:01:47  0:03:44  159k
 32 61.4M   32 20.0M    0     0   189k      0  0:05:32  0:01:48  0:03:44  165k
 32 61.4M   32 20.0M    0     0   187k      0  0:05:35  0:01:49  0:03:46  128k
 32 61.4M   32 20.2M    0     0   187k      0  0:05:36  0:01:50  0:03:46  128k
 33 61.4M   33 20.4M    0     0   187k      0  0:05:35  0:01:51  0:03:44  124k
 33 61.4M   33 20.6M    0     0   188k      0  0:05:34  0:01:52  0:03:42  144k
 34 61.4M   34 20.9M    0     0   188k      0  0:05:33  0:01:53  0:03:40  176k
 34 61.4M   34 21.2M    0     0   189k      0  0:05:31  0:01:54  0:03:37  229k
 34 61.4M   34 21.3M    0     0   189k      0  0:05:32  0:01:55  0:03:37  233k
 35 61.4M   35 21.5M    0     0   189k      0  0:05:31  0:01:56  0:03:35  236k
 35 61.4M   35 21.6M    0     0   188k      0  0:05:33  0:01:57  0:03:36  200k
 35 61.4M   35 21.7M    0     0   188k      0  0:05:34  0:01:58  0:03:36  179k
 35 61.4M   35 21.9M    0     0   187k      0  0:05:35  0:01:59  0:03:36  141k
 35 61.4M   35 22.0M    0     0   187k      0  0:05:35  0:02:00  0:03:35  150k
 36 61.4M   36 22.2M    0     0   187k      0  0:05:35  0:02:01  0:03:34  147k
 36 61.4M   36 22.4M    0     0   187k      0  0:05:35  0:02:02  0:03:33  161k
 36 61.4M   36 22.6M    0     0   187k      0  0:05:35  0:02:03  0:03:32  172k
 37 61.4M   37 22.8M    0     0   188k      0  0:05:34  0:02:04  0:03:30  202k
 37 61.4M   37 23.2M    0     0   189k      0  0:05:32  0:02:05  0:03:27  236k
 38 61.4M   38 23.6M    0     0   191k      0  0:05:29  0:02:06  0:03:23  268k
 39 61.4M   39 24.1M    0     0   193k      0  0:05:24  0:02:07  0:03:17  345k
 40 61.4M   40 24.6M    0     0   196k      0  0:05:20  0:02:08  0:03:12  419k
 40 61.4M   40 24.8M    0     0   196k      0  0:05:20  0:02:09  0:03:11  391k
 40 61.4M   40 24.9M    0     0   195k      0  0:05:22  0:02:10  0:03:12  342k
 40 61.4M   40 25.0M    0     0   194k      0  0:05:23  0:02:11  0:03:12  284k
 40 61.4M   40 25.1M    0     0   194k      0  0:05:24  0:02:12  0:03:12  207k
 41 61.4M   41 25.3M    0     0   194k      0  0:05:23  0:02:13  0:03:10  137k
 41 61.4M   41 25.6M    0     0   195k      0  0:05:22  0:02:14  0:03:08  161k
 42 61.4M   42 26.0M    0     0   196k      0  0:05:19  0:02:15  0:03:04  232k
 43 61.4M   43 26.5M    0     0   198k      0  0:05:16  0:02:16  0:03:00  307k
 43 61.4M   43 27.0M    0     0   200k      0  0:05:13  0:02:17  0:02:56  381k
 44 61.4M   44 27.5M    0     0   203k      0  0:05:08  0:02:18  0:02:50  457k
 46 61.4M   46 28.2M    0     0   207k      0  0:05:03  0:02:19  0:02:44  538k
 46 61.4M   46 28.6M    0     0   208k      0  0:05:01  0:02:20  0:02:41  527k
 47 61.4M   47 28.9M    0     0   209k      0  0:05:00  0:02:21  0:02:39  505k
 47 61.4M   47 29.2M    0     0   210k      0  0:04:59  0:02:22  0:02:37  464k
 48 61.4M   48 29.8M    0     0   212k      0  0:04:55  0:02:23  0:02:32  466k
 49 61.4M   49 30.2M    0     0   214k      0  0:04:53  0:02:24  0:02:29  403k
 49 61.4M   49 30.5M    0     0   215k      0  0:04:52  0:02:25  0:02:27  400k
 50 61.4M   50 30.8M    0     0   215k      0  0:04:51  0:02:26  0:02:25  389k
 50 61.4M   50 31.2M    0     0   216k      0  0:04:50  0:02:27  0:02:23  402k
 51 61.4M   51 31.6M    0     0   218k      0  0:04:48  0:02:28  0:02:20  371k
 52 61.4M   52 32.2M    0     0   220k      0  0:04:44  0:02:29  0:02:15  415k
 53 61.4M   53 32.8M    0     0   223k      0  0:04:41  0:02:30  0:02:11  456k
 54 61.4M   54 33.2M    0     0   224k      0  0:04:40  0:02:31  0:02:09  487k
 54 61.4M   54 33.7M    0     0   226k      0  0:04:38  0:02:32  0:02:06  506k
 55 61.4M   55 33.9M    0     0   226k      0  0:04:37  0:02:33  0:02:04  469k
 55 61.4M   55 34.0M    0     0   225k      0  0:04:38  0:02:34  0:02:04  367k
 55 61.4M   55 34.2M    0     0   225k      0  0:04:39  0:02:35  0:02:04  289k
 55 61.4M   55 34.3M    0     0   224k      0  0:04:40  0:02:36  0:02:04  225k
 56 61.4M   56 34.5M    0     0   224k      0  0:04:40  0:02:37  0:02:03  165k
 56 61.4M   56 34.6M    0     0   223k      0  0:04:41  0:02:38  0:02:03  137k
 56 61.4M   56 34.7M    0     0   223k      0  0:04:42  0:02:39  0:02:03  138k
 56 61.4M   56 34.8M    0     0   222k      0  0:04:42  0:02:40  0:02:02  134k
 57 61.4M   57 35.0M    0     0   222k      0  0:04:42  0:02:41  0:02:01  151k
 57 61.4M   57 35.2M    0     0   222k      0  0:04:43  0:02:42  0:02:01  154k
 57 61.4M   57 35.4M    0     0   221k      0  0:04:43  0:02:43  0:02:00  160k
 57 61.4M   57 35.5M    0     0   221k      0  0:04:44  0:02:44  0:02:00  169k
 58 61.4M   58 35.8M    0     0   221k      0  0:04:44  0:02:45  0:01:59  186k
 58 61.4M   58 36.0M    0     0   221k      0  0:04:43  0:02:46  0:01:57  201k
 58 61.4M   58 36.2M    0     0   221k      0  0:04:44  0:02:47  0:01:57  192k
 59 61.4M   59 36.2M    0     0   220k      0  0:04:45  0:02:48  0:01:57  172k
 59 61.4M   59 36.3M    0     0   219k      0  0:04:46  0:02:49  0:01:57  154k
 59 61.4M   59 36.3M    0     0   218k      0  0:04:48  0:02:50  0:01:58  118k
 59 61.4M   59 36.4M    0     0   217k      0  0:04:49  0:02:51  0:01:58 75038
 59 61.4M   59 36.4M    0     0   216k      0  0:04:50  0:02:52  0:01:58 59014
 59 61.4M   59 36.5M    0     0   215k      0  0:04:51  0:02:53  0:01:58 58187
 59 61.4M   59 36.5M    0     0   214k      0  0:04:53  0:02:54  0:01:59 55681
 59 61.4M   59 36.6M    0     0   213k      0  0:04:54  0:02:55  0:01:59 59442
 59 61.4M   59 36.7M    0     0   213k      0  0:04:55  0:02:56  0:01:59 68879
 60 61.4M   60 36.9M    0     0   212k      0  0:04:55  0:02:57  0:01:58 88493
 60 61.4M   60 37.0M    0     0   212k      0  0:04:56  0:02:58  0:01:58 96134
 60 61.4M   60 37.0M    0     0   211k      0  0:04:57  0:02:59  0:01:58 94571
 60 61.4M   60 37.1M    0     0   210k      0  0:04:58  0:03:00  0:01:58  102k
 60 61.4M   60 37.3M    0     0   210k      0  0:04:58  0:03:01  0:01:57  128k
 61 61.4M   61 37.5M    0     0   210k      0  0:04:58  0:03:02  0:01:56  135k
 61 61.4M   61 37.7M    0     0   210k      0  0:04:58  0:03:03  0:01:55  159k
 61 61.4M   61 37.9M    0     0   210k      0  0:04:58  0:03:04  0:01:54  183k
 62 61.4M   62 38.1M    0     0   210k      0  0:04:59  0:03:05  0:01:54  195k
 62 61.4M   62 38.4M    0     0   210k      0  0:04:58  0:03:06  0:01:52  201k
 62 61.4M   62 38.5M    0     0   210k      0  0:04:58  0:03:07  0:01:51  202k
 62 61.4M   62 38.6M    0     0   209k      0  0:04:59  0:03:08  0:01:51  178k
 62 61.4M   62 38.6M    0     0   208k      0  0:05:01  0:03:09  0:01:52  145k
 63 61.4M   63 38.7M    0     0   208k      0  0:05:02  0:03:10  0:01:52  121k
 63 61.4M   63 38.7M    0     0   206k      0  0:05:04  0:03:12  0:01:52 71290
 63 61.4M   63 38.7M    0     0   206k      0  0:05:05  0:03:12  0:01:53 44910
 63 61.4M   63 38.7M    0     0   205k      0  0:05:06  0:03:13  0:01:53 32692
 63 61.4M   63 38.8M    0     0   204k      0  0:05:07  0:03:14  0:01:53 46248
 63 61.4M   63 38.9M    0     0   203k      0  0:05:08  0:03:15  0:01:53 42656
 63 61.4M   63 38.9M    0     0   203k      0  0:05:09  0:03:16  0:01:53 50386
 63 61.4M   63 39.0M    0     0   202k      0  0:05:11  0:03:17  0:01:54 54028
 63 61.4M   63 39.0M    0     0   201k      0  0:05:12  0:03:18  0:01:54 59365
 63 61.4M   63 39.1M    0     0   200k      0  0:05:14  0:03:19  0:01:55 46531
 63 61.4M   63 39.1M    0     0   199k      0  0:05:15  0:03:20  0:01:55 38520
 63 61.4M   63 39.1M    0     0   198k      0  0:05:16  0:03:21  0:01:55 29609
 63 61.4M   63 39.2M    0     0   198k      0  0:05:17  0:03:22  0:01:55 39025
 63 61.4M   63 39.2M    0     0   197k      0  0:05:18  0:03:23  0:01:55 39188
 63 61.4M   63 39.2M    0     0   196k      0  0:05:20  0:03:24  0:01:56 40562
 64 61.4M   64 39.3M    0     0   195k      0  0:05:21  0:03:25  0:01:56 50208
 64 61.4M   64 39.4M    0     0   195k      0  0:05:22  0:03:26  0:01:56 61956
 64 61.4M   64 39.5M    0     0   194k      0  0:05:22  0:03:27  0:01:55 67356
 64 61.4M   64 39.5M    0     0   194k      0  0:05:24  0:03:28  0:01:56 64188
 64 61.4M   64 39.6M    0     0   193k      0  0:05:25  0:03:29  0:01:56 66791
 64 61.4M   64 39.6M    0     0   192k      0  0:05:26  0:03:30  0:01:56 68673
 64 61.4M   64 39.7M    0     0   192k      0  0:05:27  0:03:31  0:01:56 62448
 64 61.4M   64 39.7M    0     0   191k      0  0:05:28  0:03:32  0:01:56 47427
 64 61.4M   64 39.7M    0     0   190k      0  0:05:29  0:03:33  0:01:56 46400
 64 61.4M   64 39.8M    0     0   190k      0  0:05:31  0:03:34  0:01:57 46332
 64 61.4M   64 39.8M    0     0   189k      0  0:05:32  0:03:35  0:01:57 42706
 65 61.4M   65 40.0M    0     0   189k      0  0:05:32  0:03:36  0:01:56 65418
 65 61.4M   65 40.1M    0     0   188k      0  0:05:33  0:03:37  0:01:56 67647
 65 61.4M   65 40.1M    0     0   188k      0  0:05:34  0:03:38  0:01:56 79776
 65 61.4M   65 40.3M    0     0   188k      0  0:05:34  0:03:39  0:01:55  102k
 65 61.4M   65 40.4M    0     0   187k      0  0:05:35  0:03:40  0:01:55  111k
 65 61.4M   65 40.5M    0     0   187k      0  0:05:36  0:03:41  0:01:55   99k
 66 61.4M   66 40.6M    0     0   186k      0  0:05:36  0:03:42  0:01:54  111k
 66 61.4M   66 40.7M    0     0   186k      0  0:05:37  0:03:43  0:01:54  111k
 66 61.4M   66 40.7M    0     0   186k      0  0:05:38  0:03:44  0:01:54 98359
 66 61.4M   66 40.9M    0     0   185k      0  0:05:38  0:03:45  0:01:53  112k
 67 61.4M   67 41.2M    0     0   186k      0  0:05:37  0:03:46  0:01:51  143k
 67 61.4M   67 41.5M    0     0   187k      0  0:05:36  0:03:47  0:01:49  201k
 68 61.4M   68 41.9M    0     0   188k      0  0:05:34  0:03:48  0:01:46  258k
 68 61.4M   68 42.2M    0     0   188k      0  0:05:34  0:03:49  0:01:45  287k
 68 61.4M   68 42.3M    0     0   188k      0  0:05:34  0:03:50  0:01:44  287k
 69 61.4M   69 42.5M    0     0   188k      0  0:05:34  0:03:51  0:01:43  274k
 69 61.4M   69 42.7M    0     0   188k      0  0:05:34  0:03:52  0:01:42  233k
 69 61.4M   69 42.9M    0     0   188k      0  0:05:34  0:03:53  0:01:41  193k
 70 61.4M   70 43.1M    0     0   188k      0  0:05:34  0:03:54  0:01:40  184k
 70 61.4M   70 43.2M    0     0   187k      0  0:05:34  0:03:55  0:01:39  177k
 70 61.4M   70 43.4M    0     0   187k      0  0:05:34  0:03:56  0:01:38  175k
 70 61.4M   70 43.6M    0     0   188k      0  0:05:34  0:03:57  0:01:37  182k
 71 61.4M   71 43.9M    0     0   188k      0  0:05:33  0:03:58  0:01:35  204k
 72 61.4M   72 44.2M    0     0   189k      0  0:05:32  0:03:59  0:01:33  241k
 72 61.4M   72 44.5M    0     0   189k      0  0:05:32  0:04:00  0:01:32  260k
 72 61.4M   72 44.6M    0     0   189k      0  0:05:32  0:04:01  0:01:31  250k
 72 61.4M   72 44.7M    0     0   188k      0  0:05:33  0:04:02  0:01:31  230k
 73 61.4M   73 44.9M    0     0   188k      0  0:05:33  0:04:03  0:01:30  201k
 73 61.4M   73 45.0M    0     0   188k      0  0:05:33  0:04:04  0:01:29  153k
 73 61.4M   73 45.1M    0     0   188k      0  0:05:34  0:04:05  0:01:29  133k
 73 61.4M   73 45.4M    0     0   188k      0  0:05:33  0:04:06  0:01:27  155k
 74 61.4M   74 45.5M    0     0   188k      0  0:05:34  0:04:07  0:01:27  159k
 74 61.4M   74 45.6M    0     0   187k      0  0:05:34  0:04:08  0:01:26  146k
 74 61.4M   74 45.7M    0     0   187k      0  0:05:35  0:04:09  0:01:26  143k
 74 61.4M   74 45.9M    0     0   187k      0  0:05:35  0:04:10  0:01:25  153k
 74 61.4M   74 46.0M    0     0   187k      0  0:05:35  0:04:11  0:01:24  131k
 75 61.4M   75 46.1M    0     0   186k      0  0:05:36  0:04:12  0:01:24  118k
 75 61.4M   75 46.1M    0     0   186k      0  0:05:37  0:04:13  0:01:24  106k
 75 61.4M   75 46.2M    0     0   185k      0  0:05:38  0:04:14  0:01:24   98k
 75 61.4M   75 46.2M    0     0   185k      0  0:05:39  0:04:15  0:01:24 72237
 75 61.4M   75 46.2M    0     0   184k      0  0:05:40  0:04:16  0:01:24 51961
 75 61.4M   75 46.3M    0     0   184k      0  0:05:41  0:04:17  0:01:24 55699
 75 61.4M   75 46.5M    0     0   184k      0  0:05:41  0:04:18  0:01:23 89892
 76 61.4M   76 46.8M    0     0   184k      0  0:05:40  0:04:19  0:01:21  138k
 76 61.4M   76 47.1M    0     0   185k      0  0:05:39  0:04:20  0:01:19  172k
 76 61.4M   76 47.2M    0     0   185k      0  0:05:40  0:04:21  0:01:19  201k
 77 61.4M   77 47.3M    0     0   184k      0  0:05:40  0:04:22  0:01:18  203k
 77 61.4M   77 47.4M    0     0   184k      0  0:05:41  0:04:23  0:01:18  185k
 77 61.4M   77 47.5M    0     0   184k      0  0:05:41  0:04:24  0:01:17  140k
 77 61.4M   77 47.6M    0     0   183k      0  0:05:42  0:04:25  0:01:17  101k
 77 61.4M   77 47.6M    0     0   183k      0  0:05:43  0:04:26  0:01:17 85091
 77 61.4M   77 47.7M    0     0   182k      0  0:05:44  0:04:27  0:01:17 78752
 77 61.4M   77 47.7M    0     0   182k      0  0:05:45  0:04:28  0:01:17 65457
 77 61.4M   77 47.8M    0     0   181k      0  0:05:46  0:04:29  0:01:17 52271
 77 61.4M   77 47.9M    0     0   181k      0  0:05:46  0:04:30  0:01:16 65680
 78 61.4M   78 48.0M    0     0   181k      0  0:05:47  0:04:31  0:01:16 75458
 78 61.4M   78 48.3M    0     0   181k      0  0:05:46  0:04:32  0:01:14  134k
 79 61.4M   79 48.7M    0     0   182k      0  0:05:44  0:04:33  0:01:11  204k
 79 61.4M   79 48.9M    0     0   182k      0  0:05:44  0:04:34  0:01:10  227k
 79 61.4M   79 49.1M    0     0   182k      0  0:05:44  0:04:35  0:01:09  246k
 80 61.4M   80 49.4M    0     0   183k      0  0:05:43  0:04:36  0:01:07  287k
 80 61.4M   80 49.6M    0     0   183k      0  0:05:43  0:04:37  0:01:06  262k
 81 61.4M   81 49.9M    0     0   183k      0  0:05:42  0:04:38  0:01:04  230k
 81 61.4M   81 50.3M    0     0   184k      0  0:05:41  0:04:39  0:01:02  284k
 82 61.4M   82 50.5M    0     0   184k      0  0:05:41  0:04:40  0:01:01  289k
 82 61.4M   82 50.5M    0     0   183k      0  0:05:42  0:04:41  0:01:01  235k
 82 61.4M   82 50.6M    0     0   183k      0  0:05:42  0:04:42  0:01:00  201k
 82 61.4M   82 50.7M    0     0   183k      0  0:05:43  0:04:43  0:01:00  162k
 82 61.4M   82 50.7M    0     0   182k      0  0:05:44  0:04:44  0:01:00 91750
 82 61.4M   82 50.8M    0     0   182k      0  0:05:45  0:04:45  0:01:00 59211
 82 61.4M   82 50.9M    0     0   182k      0  0:05:45  0:04:46  0:00:59 75670
 83 61.4M   83 51.0M    0     0   181k      0  0:05:46  0:04:47  0:00:59 84057
 83 61.4M   83 51.1M    0     0   181k      0  0:05:46  0:04:48  0:00:58 85439
 83 61.4M   83 51.1M    0     0   181k      0  0:05:47  0:04:49  0:00:58 88140
 83 61.4M   83 51.2M    0     0   180k      0  0:05:48  0:04:50  0:00:58 98261
 83 61.4M   83 51.3M    0     0   180k      0  0:05:48  0:04:51  0:00:57 91805
 83 61.4M   83 51.5M    0     0   180k      0  0:05:48  0:04:52  0:00:56  103k
 84 61.4M   84 51.6M    0     0   180k      0  0:05:49  0:04:53  0:00:56  111k
 84 61.4M   84 51.7M    0     0   180k      0  0:05:49  0:04:54  0:00:55  121k
 84 61.4M   84 51.9M    0     0   180k      0  0:05:49  0:04:55  0:00:54  140k
 84 61.4M   84 52.2M    0     0   180k      0  0:05:49  0:04:56  0:00:53  169k
 85 61.4M   85 52.4M    0     0   180k      0  0:05:48  0:04:57  0:00:51  189k
 85 61.4M   85 52.6M    0     0   180k      0  0:05:48  0:04:58  0:00:50  205k
 86 61.4M   86 52.8M    0     0   180k      0  0:05:48  0:04:59  0:00:49  224k
 86 61.4M   86 53.1M    0     0   181k      0  0:05:47  0:05:00  0:00:47  239k
 86 61.4M   86 53.2M    0     0   180k      0  0:05:48  0:05:01  0:00:47  208k
 86 61.4M   86 53.3M    0     0   180k      0  0:05:48  0:05:02  0:00:46  172k
 87 61.4M   87 53.5M    0     0   180k      0  0:05:48  0:05:03  0:00:45  176k
 87 61.4M   87 53.8M    0     0   180k      0  0:05:47  0:05:04  0:00:43  192k
 87 61.4M   87 54.0M    0     0   180k      0  0:05:47  0:05:05  0:00:42  171k
 88 61.4M   88 54.1M    0     0   180k      0  0:05:47  0:05:06  0:00:41  187k
 88 61.4M   88 54.2M    0     0   180k      0  0:05:48  0:05:07  0:00:41  195k
 88 61.4M   88 54.4M    0     0   180k      0  0:05:48  0:05:08  0:00:40  191k
 88 61.4M   88 54.6M    0     0   180k      0  0:05:47  0:05:09  0:00:38  175k
 89 61.4M   89 55.0M    0     0   181k      0  0:05:46  0:05:10  0:00:36  211k
 90 61.4M   90 55.3M    0     0   182k      0  0:05:45  0:05:11  0:00:34  250k
 90 61.4M   90 55.6M    0     0   182k      0  0:05:45  0:05:12  0:00:33  285k
 91 61.4M   91 55.9M    0     0   182k      0  0:05:44  0:05:13  0:00:31  301k
 91 61.4M   91 56.1M    0     0   182k      0  0:05:44  0:05:14  0:00:30  297k
 91 61.4M   91 56.3M    0     0   182k      0  0:05:44  0:05:15  0:00:29  271k
 92 61.4M   92 56.6M    0     0   183k      0  0:05:43  0:05:16  0:00:27  259k
 92 61.4M   92 56.9M    0     0   183k      0  0:05:42  0:05:17  0:00:25  262k
 93 61.4M   93 57.2M    0     0   184k      0  0:05:41  0:05:18  0:00:23  275k
 93 61.4M   93 57.5M    0     0   184k      0  0:05:41  0:05:19  0:00:22  288k
 94 61.4M   94 57.8M    0     0   184k      0  0:05:40  0:05:20  0:00:20  303k
 94 61.4M   94 58.3M    0     0   185k      0  0:05:38  0:05:21  0:00:17  342k
 95 61.4M   95 58.7M    0     0   186k      0  0:05:37  0:05:22  0:00:15  371k
 96 61.4M   96 59.3M    0     0   187k      0  0:05:35  0:05:23  0:00:12  416k
 97 61.4M   97 59.8M    0     0   188k      0  0:05:33  0:05:24  0:00:09  473k
 98 61.4M   98 60.4M    0     0   190k      0  0:05:30  0:05:25  0:00:05  541k
 99 61.4M   99 61.2M    0     0   191k      0  0:05:27  0:05:26  0:00:01  591k
100 61.4M  100 61.4M    0     0   192k      0  0:05:27  0:05:27 --:--:--  617k

Extract the archive file into the tutorial directory:

bash
tar xf tutorial/data.tar.gz --directory=tutorial

The software we are going to use in this tutorial can be installed using the conda package manager. Please refer to the previous conda workshop for details on installing software and creating conda environments.

3.1 Install FastQC

Create a new environment with FastQC installed:

bash
conda create --yes --name fastqc fastqc
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... done
## 
## ## Package Plan ##
## 
##   environment location: /opt/miniconda3/envs/fastqc
## 
##   added / updated specs:
##     - fastqc
## 
## 
## The following NEW packages will be INSTALLED:
## 
##   expat              conda-forge/osx-64::expat-2.4.9-hf0c8a7f_0
##   fastqc             bioconda/noarch::fastqc-0.11.9-hdfd78af_1
##   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0
##   fontconfig         conda-forge/osx-64::fontconfig-2.14.0-h5bb23bf_1
##   freetype           conda-forge/osx-64::freetype-2.12.1-h3f81eb7_0
##   libcxx             conda-forge/osx-64::libcxx-14.0.6-hccf4f1f_0
##   libpng             conda-forge/osx-64::libpng-1.6.38-ha978bb4_0
##   libzlib            conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
##   openjdk            conda-forge/osx-64::openjdk-17.0.3-hbc0c0cd_2
##   perl               conda-forge/osx-64::perl-5.32.1-2_h0d85af4_perl5
## 
## 
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## #     $ conda activate fastqc
## #
## # To deactivate an active environment, use
## #
## #     $ conda deactivate
## 
## Retrieving notices: ...working... done

Activate the new environment to use it:

bash
conda activate fastqc

Test that the fastqc command is available:

bash
which fastqc
## /opt/miniconda3/envs/fastqc/bin/fastqc

3.1.1 Install Cutadapt

Create a new environment with Cutadapt installed:

bash
conda create --yes --name cutadapt cutadapt
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... done
## 
## ## Package Plan ##
## 
##   environment location: /opt/miniconda3/envs/cutadapt
## 
##   added / updated specs:
##     - cutadapt
## 
## 
## The following NEW packages will be INSTALLED:
## 
##   bzip2              conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4
##   ca-certificates    conda-forge/osx-64::ca-certificates-2022.9.14-h033912b_0
##   cutadapt           bioconda/osx-64::cutadapt-4.1-py310hd528523_1
##   dnaio              bioconda/osx-64::dnaio-0.9.1-py310hd528523_1
##   isa-l              conda-forge/osx-64::isa-l-2.30.0-h0d85af4_4
##   libffi             conda-forge/osx-64::libffi-3.4.2-h0d85af4_5
##   libsqlite          conda-forge/osx-64::libsqlite-3.39.3-ha978bb4_0
##   libzlib            conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
##   ncurses            conda-forge/osx-64::ncurses-6.3-h96cf925_1
##   openssl            conda-forge/osx-64::openssl-3.0.5-hfd90126_2
##   pbzip2             conda-forge/osx-64::pbzip2-1.1.13-h9d27c22_1
##   pigz               conda-forge/osx-64::pigz-2.6-h5dbffcc_0
##   pip                conda-forge/noarch::pip-22.2.2-pyhd8ed1ab_0
##   python             conda-forge/osx-64::python-3.10.6-hc14f532_0_cpython
##   python-isal        conda-forge/osx-64::python-isal-1.0.1-py310h3c08dca_0
##   python_abi         conda-forge/osx-64::python_abi-3.10-2_cp310
##   readline           conda-forge/osx-64::readline-8.1.2-h3899abd_0
##   setuptools         conda-forge/noarch::setuptools-65.3.0-pyhd8ed1ab_1
##   tk                 conda-forge/osx-64::tk-8.6.12-h5dbffcc_0
##   tzdata             conda-forge/noarch::tzdata-2022c-h191b570_0
##   wheel              conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
##   xopen              conda-forge/osx-64::xopen-1.6.0-py310h2ec42d9_0
##   xz                 conda-forge/osx-64::xz-5.2.6-h775f41a_0
##   zlib               conda-forge/osx-64::zlib-1.2.12-hfd90126_3
## 
## 
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## #     $ conda activate cutadapt
## #
## # To deactivate an active environment, use
## #
## #     $ conda deactivate
## 
## Retrieving notices: ...working... done

Activate the new environment to use it:

bash
conda activate cutadapt

Test that the cutadapt command is available:

bash
which cutadapt
## /opt/miniconda3/envs/cutadapt/bin/cutadapt

3.2 Sequence quality

Quality control is an important step in the processing of any type of data. It tells the researcher whether the data is suitable for analysis and whether the conclusions drawn can be trusted. Data integrity however is just one part of the quality control procedure and many other issues, outside of the data, may remain.

In the example data directory are two FASTQ files. The ‘good’ file has no quality problems and passes all of the FastQC modules. The ‘bad’ file has lots of problems and would not be suitable for data analysis. Lets run FastQC on both files and compare the results.

Activate the fastqc environment:

bash
conda activate fastqc

Print the help information for the fastqc command:

bash
fastqc --help
## 
##             FastQC - A high throughput sequence QC analysis tool
## 
## SYNOPSIS
## 
##  fastqc seqfile1 seqfile2 .. seqfileN
## 
##     fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
##            [-c contaminant file] seqfile1 .. seqfileN
## 
## DESCRIPTION
## 
##     FastQC reads a set of sequence files and produces from each one a quality
##     control report consisting of a number of different modules, each one of 
##     which will help to identify a different potential type of problem in your
##     data.
##     
##     If no files to process are specified on the command line then the program
##     will start as an interactive graphical application.  If files are provided
##     on the command line then the program will run with no user interaction
##     required.  In this mode it is suitable for inclusion into a standardised
##     analysis pipeline.
##     
##     The options for the program as as follows:
##     
##     -h --help       Print this help file and exit
##     
##     -v --version    Print the version of the program and exit
##     
##     -o --outdir     Create all output files in the specified output directory.
##                     Please note that this directory must exist as the program
##                     will not create it.  If this option is not set then the 
##                     output file for each sequence file is created in the same
##                     directory as the sequence file which was processed.
##                     
##     --casava        Files come from raw casava output. Files in the same sample
##                     group (differing only by the group number) will be analysed
##                     as a set rather than individually. Sequences with the filter
##                     flag set in the header will be excluded from the analysis.
##                     Files must have the same names given to them by casava
##                     (including being gzipped and ending with .gz) otherwise they
##                     won't be grouped together correctly.
##                     
##     --nano          Files come from nanopore sequences and are in fast5 format. In
##                     this mode you can pass in directories to process and the program
##                     will take in all fast5 files within those directories and produce
##                     a single output file from the sequences found in all files.                    
##                     
##     --nofilter      If running with --casava then don't remove read flagged by
##                     casava as poor quality when performing the QC analysis.
##                    
##     --extract       If set then the zipped output file will be uncompressed in
##                     the same directory after it has been created.  By default
##                     this option will be set if fastqc is run in non-interactive
##                     mode.
##                     
##     -j --java       Provides the full path to the java binary you want to use to
##                     launch fastqc. If not supplied then java is assumed to be in
##                     your path.
##                    
##     --noextract     Do not uncompress the output file after creating it.  You
##                     should set this option if you do not wish to uncompress
##                     the output when running in non-interactive mode.
##                     
##     --nogroup       Disable grouping of bases for reads >50bp. All reports will
##                     show data for every base in the read.  WARNING: Using this
##                     option will cause fastqc to crash and burn if you use it on
##                     really long reads, and your plots may end up a ridiculous size.
##                     You have been warned!
##                     
##     --min_length    Sets an artificial lower limit on the length of the sequence
##                     to be shown in the report.  As long as you set this to a value
##                     greater or equal to your longest read length then this will be
##                     the sequence length used to create your read groups.  This can
##                     be useful for making directly comaparable statistics from 
##                     datasets with somewhat variable read lengths.
##                     
##     -f --format     Bypasses the normal sequence file format detection and
##                     forces the program to use the specified format.  Valid
##                     formats are bam,sam,bam_mapped,sam_mapped and fastq
##                     
##     -t --threads    Specifies the number of files which can be processed
##                     simultaneously.  Each thread will be allocated 250MB of
##                     memory so you shouldn't run more threads than your
##                     available memory will cope with, and not more than
##                     6 threads on a 32 bit machine
##                   
##     -c              Specifies a non-default file which contains the list of
##     --contaminants  contaminants to screen overrepresented sequences against.
##                     The file must contain sets of named contaminants in the
##                     form name[tab]sequence.  Lines prefixed with a hash will
##                     be ignored.
## 
##     -a              Specifies a non-default file which contains the list of
##     --adapters      adapter sequences which will be explicity searched against
##                     the library. The file must contain sets of named adapters
##                     in the form name[tab]sequence.  Lines prefixed with a hash
##                     will be ignored.
##                     
##     -l              Specifies a non-default file which contains a set of criteria
##     --limits        which will be used to determine the warn/error limits for the
##                     various modules.  This file can also be used to selectively 
##                     remove some modules from the output all together.  The format
##                     needs to mirror the default limits.txt file found in the
##                     Configuration folder.
##                     
##    -k --kmers       Specifies the length of Kmer to look for in the Kmer content
##                     module. Specified Kmer length must be between 2 and 10. Default
##                     length is 7 if not specified.
##                     
##    -q --quiet       Supress all progress messages on stdout and only report errors.
##    
##    -d --dir         Selects a directory to be used for temporary files written when
##                     generating report images. Defaults to system temp directory if
##                     not specified.
##                     
## BUGS
## 
##     Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
##     or in www.bioinformatics.babraham.ac.uk/bugzilla/
##                    
## 

Create a fastqc output directory:

bash
mkdir tutorial/fastqc

Run FastQC to generate a quality control report for the ‘good’ data:

bash
fastqc -o tutorial/fastqc tutorial/data/good.fastq.gz
## Started analysis of good.fastq.gz
## Approx 5% complete for good.fastq.gz
## Approx 10% complete for good.fastq.gz
## Approx 15% complete for good.fastq.gz
## Approx 20% complete for good.fastq.gz
## Approx 25% complete for good.fastq.gz
## Approx 30% complete for good.fastq.gz
## Approx 35% complete for good.fastq.gz
## Approx 40% complete for good.fastq.gz
## Approx 45% complete for good.fastq.gz
## Approx 50% complete for good.fastq.gz
## Approx 55% complete for good.fastq.gz
## Approx 60% complete for good.fastq.gz
## Approx 65% complete for good.fastq.gz
## Approx 70% complete for good.fastq.gz
## Approx 75% complete for good.fastq.gz
## Approx 80% complete for good.fastq.gz
## Approx 85% complete for good.fastq.gz
## Approx 90% complete for good.fastq.gz
## Approx 95% complete for good.fastq.gz
## Approx 100% complete for good.fastq.gz
## Analysis complete for good.fastq.gz

Inspect the fastqc output directory:

bash
ls tutorial/fastqc
## good_fastqc.html
## good_fastqc.zip

You should see that FastQC generates a .html report and a .zip folder. The HTML report contains the results from a number of tests which FastQC runs on the sequencing data. For an explanation of the tests, please refer to the FastQC documentation for each one:

Inspect the quality control report for the ‘good’ data:

Output

The report uses a traffic light system (Green, Yellow, Red) to grade the result from each test (Pass, Warn, Fail). For this particular example, the data passed all of the tests. Once your data passes the majority of these tests it is safe to proceed with any downstream analysis or processing.

Importantly, while these tests are useful, do not rely on them to absolve you of any responsibility. Data which passes all of these tests may still exhibit biases which are not detected, and vice versa. It is important to think about what the tests are measuring and whether those measurements make sense for your particular data.

Lets now extract the .zip folder and see what it contains:

bash
unzip -o tutorial/fastqc/good_fastqc.zip -d tutorial/fastqc
## Archive:  tutorial/fastqc/good_fastqc.zip
##    creating: tutorial/fastqc/good_fastqc/
##    creating: tutorial/fastqc/good_fastqc/Icons/
##    creating: tutorial/fastqc/good_fastqc/Images/
##   inflating: tutorial/fastqc/good_fastqc/Icons/fastqc_icon.png  
##   inflating: tutorial/fastqc/good_fastqc/Icons/warning.png  
##   inflating: tutorial/fastqc/good_fastqc/Icons/error.png  
##   inflating: tutorial/fastqc/good_fastqc/Icons/tick.png  
##   inflating: tutorial/fastqc/good_fastqc/summary.txt  
##   inflating: tutorial/fastqc/good_fastqc/Images/per_base_quality.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/per_tile_quality.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/per_sequence_quality.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/per_base_sequence_content.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/per_sequence_gc_content.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/per_base_n_content.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/sequence_length_distribution.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/duplication_levels.png  
##   inflating: tutorial/fastqc/good_fastqc/Images/adapter_content.png  
##   inflating: tutorial/fastqc/good_fastqc/fastqc_report.html  
##   inflating: tutorial/fastqc/good_fastqc/fastqc_data.txt  
##   inflating: tutorial/fastqc/good_fastqc/fastqc.fo

List all files in the extracted folder:

bash
ls tutorial/fastqc/good_fastqc
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt

The folder contains images and supplementary data. The images are those present in the .html report so we don’t need to look at these again. The others we will look at are the fastqc_data.txt and summary.txt files.

First, lets display the contents of the fastqc_data.txt file:

bash
cat tutorial/fastqc/good_fastqc/fastqc_data.txt
## ##FastQC 0.11.9
## >>Basic Statistics   pass
## #Measure Value
## Filename good.fastq.gz
## File type    Conventional base calls
## Encoding Illumina 1.5
## Total Sequences  250000
## Sequences flagged as poor quality    0
## Sequence length  40
## %GC  45
## >>END_MODULE
## >>Per base sequence quality  pass
## #Base    Mean    Median  Lower Quartile  Upper Quartile  10th Percentile 90th Percentile
## 1    37.018596   38.0    38.0    38.0    35.0    38.0
## 2    36.964208   38.0    37.0    38.0    35.0    38.0
## 3    37.128464   38.0    37.0    38.0    35.0    38.0
## 4    37.111792   38.0    37.0    38.0    35.0    38.0
## 5    37.154904   38.0    38.0    38.0    36.0    38.0
## 6    37.205776   38.0    38.0    38.0    36.0    38.0
## 7    37.206184   38.0    38.0    38.0    36.0    38.0
## 8    37.169392   38.0    38.0    38.0    36.0    38.0
## 9    37.15452    38.0    38.0    38.0    35.0    38.0
## 10   37.126808   38.0    37.0    38.0    35.0    38.0
## 11   36.82764    38.0    37.0    38.0    35.0    38.0
## 12   36.893108   38.0    37.0    38.0    35.0    38.0
## 13   36.979468   38.0    37.0    38.0    35.0    38.0
## 14   36.996272   38.0    37.0    38.0    35.0    38.0
## 15   36.903852   38.0    37.0    38.0    35.0    38.0
## 16   37.063328   38.0    37.0    38.0    35.0    38.0
## 17   37.044728   38.0    37.0    38.0    35.0    38.0
## 18   37.005588   38.0    37.0    38.0    35.0    38.0
## 19   37.023184   38.0    37.0    38.0    35.0    38.0
## 20   37.025436   38.0    37.0    38.0    35.0    38.0
## 21   36.98352    38.0    37.0    38.0    35.0    38.0
## 22   36.962568   38.0    37.0    38.0    35.0    38.0
## 23   36.94942    38.0    37.0    38.0    35.0    38.0
## 24   36.937728   38.0    37.0    38.0    35.0    38.0
## 25   36.913956   38.0    37.0    38.0    35.0    38.0
## 26   36.838288   38.0    37.0    38.0    35.0    38.0
## 27   36.810176   38.0    37.0    38.0    35.0    38.0
## 28   36.779396   38.0    37.0    38.0    35.0    38.0
## 29   36.764576   38.0    37.0    38.0    35.0    38.0
## 30   36.739352   38.0    37.0    38.0    35.0    38.0
## 31   36.680796   38.0    37.0    38.0    35.0    38.0
## 32   36.6185 38.0    37.0    38.0    35.0    38.0
## 33   36.577148   38.0    37.0    38.0    34.0    38.0
## 34   36.517688   38.0    36.0    38.0    34.0    38.0
## 35   36.448552   38.0    36.0    38.0    34.0    38.0
## 36   36.356024   38.0    36.0    38.0    34.0    38.0
## 37   36.196704   38.0    36.0    38.0    33.0    38.0
## 38   36.226052   38.0    36.0    38.0    33.0    38.0
## 39   36.107872   38.0    36.0    38.0    33.0    38.0
## 40   35.623204   37.0    36.0    38.0    32.0    38.0
## >>END_MODULE
## >>Per tile sequence quality  pass
## #Tile    Base    Mean
## 1    1   0.0
## 1    2   0.0
## 1    3   0.0
## 1    4   0.0
## 1    5   0.0
## 1    6   0.0
## 1    7   0.0
## 1    8   0.0
## 1    9   0.0
## 1    10  0.0
## 1    11  0.0
## 1    12  0.0
## 1    13  0.0
## 1    14  0.0
## 1    15  0.0
## 1    16  0.0
## 1    17  0.0
## 1    18  0.0
## 1    19  0.0
## 1    20  0.0
## 1    21  0.0
## 1    22  0.0
## 1    23  0.0
## 1    24  0.0
## 1    25  0.0
## 1    26  0.0
## 1    27  0.0
## 1    28  0.0
## 1    29  0.0
## 1    30  0.0
## 1    31  0.0
## 1    32  0.0
## 1    33  0.0
## 1    34  0.0
## 1    35  0.0
## 1    36  0.0
## 1    37  0.0
## 1    38  0.0
## 1    39  0.0
## 1    40  0.0
## >>END_MODULE
## >>Per sequence quality scores    pass
## #Quality Count
## 2    58.0
## 3    1.0
## 4    0.0
## 5    2.0
## 6    3.0
## 7    3.0
## 8    4.0
## 9    5.0
## 10   8.0
## 11   4.0
## 12   14.0
## 13   20.0
## 14   22.0
## 15   28.0
## 16   42.0
## 17   35.0
## 18   65.0
## 19   114.0
## 20   101.0
## 21   128.0
## 22   178.0
## 23   213.0
## 24   292.0
## 25   380.0
## 26   517.0
## 27   796.0
## 28   1142.0
## 29   1485.0
## 30   2171.0
## 31   2900.0
## 32   4039.0
## 33   5175.0
## 34   7369.0
## 35   11430.0
## 36   25412.0
## 37   179886.0
## 38   5958.0
## >>END_MODULE
## >>Per base sequence content  pass
## #Base    G   A   T   C
## 1    20.016690271096543  26.817975742937726  28.279979297644502  24.88535468832122
## 2    21.0868 26.802  28.2756 23.8356
## 3    21.0276 25.516  28.7004 24.756
## 4    21.5552 25.175199999999997  27.971200000000003  25.298399999999997
## 5    19.8816 25.629600000000003  29.1024 25.3864
## 6    19.877269690654014  25.550546245884654  28.133963252913247  26.438220810548085
## 7    19.584  25.311600000000002  29.321199999999997  25.7832
## 8    19.2504 25.3576 29.3064 26.0856
## 9    19.0136 25.2044 29.3124 26.4696
## 10   19.1928 25.523200000000003  28.650399999999998  26.6336
## 11   19.1492 25.7216 28.875600000000002  26.2536
## 12   19.3004 25.258399999999998  28.9476 26.4936
## 13   19.3076 25.269599999999997  29.1224 26.300400000000003
## 14   18.9756 25.4816 29.212  26.3308
## 15   18.92   25.4032 29.1968 26.479999999999997
## 16   19.0192 25.352400000000003  29.053600000000003  26.5748
## 17   18.9208 25.338  29.2972 26.444000000000003
## 18   18.9768 25.274  29.014  26.7352
## 19   18.9368 25.3268 28.871599999999997  26.8648
## 20   18.9652 25.3764 28.9264 26.732
## 21   18.9752 25.272  28.6188 27.134000000000004
## 22   19.0024 25.354  28.8404 26.8032
## 23   19.0588 25.137999999999998  28.8488 26.9544
## 24   18.9376 25.0708 28.8868 27.1048
## 25   19.016  25.163200000000003  28.9584 26.862399999999997
## 26   18.9668 25.0468 28.9968 26.989600000000003
## 27   18.9696 25.0456 28.967599999999997  27.017200000000003
## 28   18.788552775799857  25.150610843980765  28.99288748789913   27.06794889232025
## 29   18.859101745627928  25.167602681642908  28.884862157794526  27.088433414934638
## 30   18.996399999999998  25.243199999999998  28.708  27.0524
## 31   18.903200000000002  25.0704 28.792  27.234399999999997
## 32   18.9176 25.2032 28.7708 27.1084
## 33   19.056  24.9164 28.7568 27.2708
## 34   18.944  25.190800000000003  28.721200000000003  27.144000000000002
## 35   18.7776 24.898  29.1236 27.2008
## 36   18.7084 24.7668 29.164  27.3608
## 37   18.706  25.014799999999997  28.8904 27.388800000000003
## 38   18.6152 25.1504 28.898000000000003  27.3364
## 39   18.692  24.8808 28.861199999999997  27.566000000000003
## 40   18.691899070385126  25.036000576009215  28.817661082577324  27.454439271028335
## >>END_MODULE
## >>Per sequence GC content    pass
## #GC Content  Count
## 0    9.0
## 1    5.5
## 2    2.0
## 3    2.0
## 4    8.0
## 5    14.0
## 6    26.0
## 7    38.0
## 8    38.0
## 9    53.5
## 10   69.0
## 11   125.0
## 12   181.0
## 13   181.0
## 14   255.0
## 15   329.0
## 16   460.0
## 17   591.0
## 18   591.0
## 19   801.5
## 20   1012.0
## 21   1333.0
## 22   1654.0
## 23   1654.0
## 24   2291.5
## 25   2929.0
## 26   3727.5
## 27   4526.0
## 28   4526.0
## 29   6006.5
## 30   7487.0
## 31   9478.5
## 32   11470.0
## 33   11470.0
## 34   13628.0
## 35   15786.0
## 36   16785.5
## 37   17785.0
## 38   17785.0
## 39   18616.0
## 40   19447.0
## 41   20321.5
## 42   21196.0
## 43   21196.0
## 44   21714.5
## 45   22233.0
## 46   22642.0
## 47   23051.0
## 48   23051.0
## 49   23179.5
## 50   23308.0
## 51   22107.5
## 52   20907.0
## 53   20907.0
## 54   19226.5
## 55   17546.0
## 56   15523.5
## 57   13501.0
## 58   13501.0
## 59   11574.5
## 60   9648.0
## 61   8011.0
## 62   6374.0
## 63   6374.0
## 64   5222.5
## 65   4071.0
## 66   3167.0
## 67   2263.0
## 68   2263.0
## 69   1707.5
## 70   1152.0
## 71   877.0
## 72   602.0
## 73   602.0
## 74   485.5
## 75   369.0
## 76   277.0
## 77   185.0
## 78   185.0
## 79   151.0
## 80   117.0
## 81   88.0
## 82   59.0
## 83   59.0
## 84   52.5
## 85   46.0
## 86   32.5
## 87   19.0
## 88   19.0
## 89   15.5
## 90   12.0
## 91   9.0
## 92   6.0
## 93   6.0
## 94   4.5
## 95   3.0
## 96   2.0
## 97   1.0
## 98   1.0
## 99   1.5
## 100  2.0
## >>END_MODULE
## >>Per base N content pass
## #Base    N-Count
## 1    0.30119999999999997
## 2    0.0
## 3    0.0
## 4    0.0
## 5    0.0
## 6    0.0084
## 7    0.0
## 8    0.0
## 9    0.0
## 10   0.0
## 11   0.0
## 12   0.0
## 13   0.0
## 14   0.0
## 15   0.0
## 16   0.0
## 17   0.0
## 18   0.0
## 19   0.0
## 20   0.0
## 21   0.0
## 22   0.0
## 23   0.0
## 24   0.0
## 25   0.0
## 26   0.0
## 27   0.0
## 28   0.0072
## 29   0.0015999999999999999
## 30   0.0
## 31   0.0
## 32   0.0
## 33   0.0
## 34   0.0
## 35   0.0
## 36   0.0
## 37   0.0
## 38   0.0
## 39   0.0
## 40   0.0015999999999999999
## >>END_MODULE
## >>Sequence Length Distribution   pass
## #Length  Count
## 40   250000.0
## >>END_MODULE
## >>Sequence Duplication Levels    pass
## #Total Deduplicated Percentage   94.81007801740206
## #Duplication Level   Percentage of deduplicated  Percentage of total
## 1    98.82446040062165   93.695548006106
## 2    0.8184774289647235  1.5519981779125618
## 3    0.13770282308459525 0.39166846199600974
## 4    0.05196687276280436 0.1970793304384755
## 5    0.02849674855564644 0.13508894769015645
## 6    0.013177264859977599    0.07496025056582681
## 7    0.008639504778333843    0.05733784854458847
## 8    0.006845752143545901    0.05192370358539074
## 9    0.002127511777105116    0.018153860181324994
## >10  0.09248166995806788 2.7312695332740327
## >50  0.013090397224314817    0.7826947074312642
## >100 0.002533625269222222    0.31227717227436624
## >500 0.0 0.0
## >1k  0.0 0.0
## >5k  0.0 0.0
## >10k+    0.0 0.0
## >>END_MODULE
## >>Overrepresented sequences  pass
## >>END_MODULE
## >>Adapter Content    pass
## #Position    Illumina Universal Adapter  Illumina Small RNA 3' Adapter   Illumina Small RNA 5' Adapter   Nextera Transposase Sequence    SOLID Small RNA Adapter
## 1    0.0 0.0 0.0 0.0 0.0
## 2    0.0 0.0 0.0 0.0 0.0
## 3    0.0 0.0 0.0 0.0 0.0
## 4    0.0 0.0 0.0 0.0 0.0
## 5    0.0 0.0 0.0 0.0 0.0
## 6    0.0 0.0 0.0 0.0 4.0E-4
## 7    0.0 0.0 0.0 0.0 4.0E-4
## 8    0.0 0.0 0.0 0.0 4.0E-4
## 9    0.0 0.0 0.0 0.0 4.0E-4
## 10   0.0 0.0 0.0 0.0 4.0E-4
## 11   0.0 0.0 0.0 0.0 4.0E-4
## 12   0.0 0.0 0.0 0.0 4.0E-4
## 13   0.0 0.0 0.0 0.0 4.0E-4
## 14   0.0 0.0 0.0 0.0 4.0E-4
## 15   0.0 0.0 0.0 0.0 4.0E-4
## 16   0.0 0.0 0.0 0.0 4.0E-4
## 17   0.0 0.0 0.0 0.0 4.0E-4
## 18   0.0 0.0 0.0 0.0 4.0E-4
## 19   0.0 0.0 0.0 0.0 4.0E-4
## 20   0.0036  0.0 0.0 0.0 4.0E-4
## 21   0.0044  0.0 0.0 0.0 4.0E-4
## 22   0.0044  0.0 0.0 0.0 4.0E-4
## 23   0.0052  0.0 0.0 0.0 4.0E-4
## 24   0.0052  0.0 0.0 0.0 4.0E-4
## 25   0.0052  0.0 0.0 0.0 4.0E-4
## 26   0.0052  0.0 0.0 0.0 4.0E-4
## 27   0.0052  0.0 0.0 0.0 4.0E-4
## 28   0.0052  0.0 0.0 4.0E-4  4.0E-4
## 29   0.0052  0.0 0.0 4.0E-4  4.0E-4
## >>END_MODULE

As you may have guessed, this file contains all of the data FastQC calculates when running each of the test modules. The layout of the file is not really amenable to processing on the command line so we won’t inspect it any further.

Next, display the contents of the summary.txt file:

bash
cat tutorial/fastqc/good_fastqc/summary.txt
## PASS Basic Statistics    good.fastq.gz
## PASS Per base sequence quality   good.fastq.gz
## PASS Per tile sequence quality   good.fastq.gz
## PASS Per sequence quality scores good.fastq.gz
## PASS Per base sequence content   good.fastq.gz
## PASS Per sequence GC content good.fastq.gz
## PASS Per base N content  good.fastq.gz
## PASS Sequence Length Distribution    good.fastq.gz
## PASS Sequence Duplication Levels good.fastq.gz
## PASS Overrepresented sequences   good.fastq.gz
## PASS Adapter Content good.fastq.gz

Obviously, this file contains a summary of the results for each test module. The first column lists the result, the second is the test module and the third is the input file. For comparison, lets now run FastQC on the ‘bad’ file included in the example data directory.

Run FastQC to generate a quality control report for the ‘bad’ data:

bash
fastqc -o tutorial/fastqc tutorial/data/bad.fastq.gz
## Started analysis of bad.fastq.gz
## Approx 5% complete for bad.fastq.gz
## Approx 10% complete for bad.fastq.gz
## Approx 15% complete for bad.fastq.gz
## Approx 20% complete for bad.fastq.gz
## Approx 25% complete for bad.fastq.gz
## Approx 30% complete for bad.fastq.gz
## Approx 35% complete for bad.fastq.gz
## Approx 40% complete for bad.fastq.gz
## Approx 45% complete for bad.fastq.gz
## Approx 50% complete for bad.fastq.gz
## Approx 55% complete for bad.fastq.gz
## Approx 60% complete for bad.fastq.gz
## Approx 65% complete for bad.fastq.gz
## Approx 70% complete for bad.fastq.gz
## Approx 75% complete for bad.fastq.gz
## Approx 80% complete for bad.fastq.gz
## Approx 85% complete for bad.fastq.gz
## Approx 90% complete for bad.fastq.gz
## Approx 95% complete for bad.fastq.gz
## Approx 100% complete for bad.fastq.gz
## Analysis complete for bad.fastq.gz

Inspect the quality control report:

Output

As expected, this data has a lot of problems:

  • The base sequence quality severely decreases toward the end of the reads
  • The tile sequence quality exhibits spots of poor read quality
  • The per base sequence content shows a bias toward thymine
  • The per sequence GC content does not match the theoretical distribution
  • There are lots of over-represented sequences with no identifiable origin
  • There appears to be some adapter contamination from the Illumina Paired End PCR Primer 2

So what can we do about this? It would be impractical to re-do the sequencing again and again until we end up with ‘good’ quality data. Instead, we can process or ‘clean’ the data using read trimming software.

3.3 Read trimming

Read trimming is used to ‘clean’ reads which exhibit adapter contamination or low-quality bases. There is an endless number of read trimming programs available, but all of them basically do the same thing. How fast and how many options they have is usually what differentiates them. We are going to use Cutadapt mainly because of the extensive documentation. Do not underestimate how important and rare well organised and detailed documentation is in the world of bioinformatics software.

In the example data directory is a ‘contaminated’ FASTQ file. We are going to use this file to demonstrate the effect read trimming has on the quality of the data, as measured by the FastQC report.

Run FastQC to generate a quality control report before trimming:

bash
fastqc -o tutorial/fastqc tutorial/data/contam.fastq.gz
## Started analysis of contam.fastq.gz
## Approx 5% complete for contam.fastq.gz
## Approx 10% complete for contam.fastq.gz
## Approx 15% complete for contam.fastq.gz
## Approx 20% complete for contam.fastq.gz
## Approx 25% complete for contam.fastq.gz
## Approx 30% complete for contam.fastq.gz
## Approx 35% complete for contam.fastq.gz
## Approx 40% complete for contam.fastq.gz
## Approx 45% complete for contam.fastq.gz
## Approx 50% complete for contam.fastq.gz
## Approx 55% complete for contam.fastq.gz
## Approx 60% complete for contam.fastq.gz
## Approx 65% complete for contam.fastq.gz
## Approx 70% complete for contam.fastq.gz
## Approx 75% complete for contam.fastq.gz
## Approx 80% complete for contam.fastq.gz
## Approx 85% complete for contam.fastq.gz
## Approx 90% complete for contam.fastq.gz
## Approx 95% complete for contam.fastq.gz
## Approx 100% complete for contam.fastq.gz
## Analysis complete for contam.fastq.gz

Inspect the quality control report:

Output

The un-trimmed data has the following issues:

  • The per base sequence quality degrades massively
  • The per sequence quality scores are quite variable
  • The per sequence GC content does not match the theoretical distribution
  • At least 10% of the reads have adapter contamination (Illumina Universal Adapter)

We can use Cutadapt to address these issues by performing read quality and adapter trimming. Cutadapt has a number of options and it would be worth you reading through these at some point to get a sense of what problems can and can’t be addressed.

Activate the cutadapt environment:

bash
conda activate cutadapt

Print the help information for the cutadapt command:

bash
cutadapt --help
## cutadapt version 4.1
## 
## Copyright (C) 2010-2022 Marcel Martin <marcel.martin@scilifelab.se>
## 
## cutadapt removes adapter sequences from high-throughput sequencing reads.
## 
## Usage:
##     cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
## 
## For paired-end reads:
##     cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
## 
## Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
## characters are supported. All reads from input.fastq will be written to
## output.fastq with the adapter sequence removed. Adapter matching is
## error-tolerant. Multiple adapter sequences can be given (use further -a
## options), but only the best-matching adapter will be removed.
## 
## Input may also be in FASTA format. Compressed input and output is supported and
## auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
## standard input/output. Without the -o option, output is sent to standard output.
## 
## Citation:
## 
## Marcel Martin. Cutadapt removes adapter sequences from high-throughput
## sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
## http://dx.doi.org/10.14806/ej.17.1.200
## 
## Run "cutadapt --help" to see all command-line options.
## See https://cutadapt.readthedocs.io/ for full documentation.
## 
## Options:
##   -h, --help            Show this help message and exit
##   --version             Show version number and exit
##   --debug               Print debug log. Use twice to also print DP matrices
##   -j CORES, --cores CORES
##                         Number of CPU cores to use. Use 0 to auto-detect.
##                         Default: 1
## 
## Finding adapters:
##   Parameters -a, -g, -b specify adapters to be removed from each read (or from
##   R1 if data is paired-end. If specified multiple times, only the best
##   matching adapter is trimmed (but see the --times option). Use notation
##   'file:FILE' to read adapter sequences from a FASTA file.
## 
##   -a ADAPTER, --adapter ADAPTER
##                         Sequence of an adapter ligated to the 3' end (paired
##                         data: of the first read). The adapter and subsequent
##                         bases are trimmed. If a '$' character is appended
##                         ('anchoring'), the adapter is only found if it is a
##                         suffix of the read.
##   -g ADAPTER, --front ADAPTER
##                         Sequence of an adapter ligated to the 5' end (paired
##                         data: of the first read). The adapter and any preceding
##                         bases are trimmed. Partial matches at the 5' end are
##                         allowed. If a '^' character is prepended ('anchoring'),
##                         the adapter is only found if it is a prefix of the read.
##   -b ADAPTER, --anywhere ADAPTER
##                         Sequence of an adapter that may be ligated to the 5' or
##                         3' end (paired data: of the first read). Both types of
##                         matches as described under -a and -g are allowed. If the
##                         first base of the read is part of the match, the
##                         behavior is as with -g, otherwise as with -a. This
##                         option is mostly for rescuing failed library
##                         preparations - do not use if you know which end your
##                         adapter was ligated to!
##   -e E, --error-rate E, --errors E
##                         Maximum allowed error rate (if 0 <= E < 1), or absolute
##                         number of errors for full-length adapter match (if E is
##                         an integer >= 1). Error rate = no. of errors divided by
##                         length of matching region. Default: 0.1 (10%)
##   --no-indels           Allow only mismatches in alignments. Default: allow both
##                         mismatches and indels
##   -n COUNT, --times COUNT
##                         Remove up to COUNT adapters from each read. Default: 1
##   -O MINLENGTH, --overlap MINLENGTH
##                         Require MINLENGTH overlap between read and adapter for
##                         an adapter to be found. Default: 3
##   --match-read-wildcards
##                         Interpret IUPAC wildcards in reads. Default: False
##   -N, --no-match-adapter-wildcards
##                         Do not interpret IUPAC wildcards in adapters.
##   --action {trim,retain,mask,lowercase,none}
##                         What to do if a match was found. trim: trim adapter and
##                         up- or downstream sequence; retain: trim, but retain
##                         adapter; mask: replace with 'N' characters; lowercase:
##                         convert to lowercase; none: leave unchanged. Default:
##                         trim
##   --rc, --revcomp       Check both the read and its reverse complement for
##                         adapter matches. If match is on reverse-complemented
##                         version, output that one. Default: check only read
## 
## Additional read modifications:
##   -u LENGTH, --cut LENGTH
##                         Remove bases from each read (first read only if paired).
##                         If LENGTH is positive, remove bases from the beginning.
##                         If LENGTH is negative, remove bases from the end. Can be
##                         used twice if LENGTHs have different signs. This is
##                         applied *before* adapter trimming.
##   --nextseq-trim 3'CUTOFF
##                         NextSeq-specific quality trimming (each read). Trims
##                         also dark cycles appearing as high-quality G bases.
##   -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff [5'CUTOFF,]3'CUTOFF
##                         Trim low-quality bases from 5' and/or 3' ends of each
##                         read before adapter removal. Applied to both reads if
##                         data is paired. If one value is given, only the 3' end
##                         is trimmed. If two comma-separated cutoffs are given,
##                         the 5' end is trimmed with the first cutoff, the 3' end
##                         with the second.
##   --quality-base N      Assume that quality values in FASTQ are encoded as
##                         ascii(quality + N). This needs to be set to 64 for some
##                         old Illumina FASTQ files. Default: 33
##   --length LENGTH, -l LENGTH
##                         Shorten reads to LENGTH. Positive values remove bases at
##                         the end while negative ones remove bases at the
##                         beginning. This and the following modifications are
##                         applied after adapter trimming.
##   --trim-n              Trim N's on ends of reads.
##   --length-tag TAG      Search for TAG followed by a decimal number in the
##                         description field of the read. Replace the decimal
##                         number with the correct length of the trimmed read. For
##                         example, use --length-tag 'length=' to correct fields
##                         like 'length=123'.
##   --strip-suffix STRIP_SUFFIX
##                         Remove this suffix from read names if present. Can be
##                         given multiple times.
##   -x PREFIX, --prefix PREFIX
##                         Add this prefix to read names. Use {name} to insert the
##                         name of the matching adapter.
##   -y SUFFIX, --suffix SUFFIX
##                         Add this suffix to read names; can also include {name}
##   --rename TEMPLATE     Rename reads using TEMPLATE containing variables such as
##                         {id}, {adapter_name} etc. (see documentation)
##   --zero-cap, -z        Change negative quality values to zero.
## 
## Filtering of processed reads:
##   Filters are applied after above read modifications. Paired-end reads are
##   always discarded pairwise (see also --pair-filter).
## 
##   -m LEN[:LEN2], --minimum-length LEN[:LEN2]
##                         Discard reads shorter than LEN. Default: 0
##   -M LEN[:LEN2], --maximum-length LEN[:LEN2]
##                         Discard reads longer than LEN. Default: no limit
##   --max-n COUNT         Discard reads with more than COUNT 'N' bases. If COUNT
##                         is a number between 0 and 1, it is interpreted as a
##                         fraction of the read length.
##   --max-expected-errors ERRORS, --max-ee ERRORS
##                         Discard reads whose expected number of errors (computed
##                         from quality values) exceeds ERRORS.
##   --discard-trimmed, --discard
##                         Discard reads that contain an adapter. Use also -O to
##                         avoid discarding too many randomly matching reads.
##   --discard-untrimmed, --trimmed-only
##                         Discard reads that do not contain an adapter.
##   --discard-casava      Discard reads that did not pass CASAVA filtering (header
##                         has :Y:).
## 
## Output:
##   --quiet               Print only error messages.
##   --report {full,minimal}
##                         Which type of report to print: 'full' or 'minimal'.
##                         Default: full
##   --json FILE           Dump report in JSON format to FILE
##   -o FILE, --output FILE
##                         Write trimmed reads to FILE. FASTQ or FASTA format is
##                         chosen depending on input. Summary report is sent to
##                         standard output. Use '{name}' for demultiplexing (see
##                         docs). Default: write to standard output
##   --fasta               Output FASTA to standard output even on FASTQ input.
##   -Z                    Use compression level 1 for gzipped output files
##                         (faster, but uses more space)
##   --info-file FILE      Write information about each read and its adapter
##                         matches into FILE. See the documentation for the file
##                         format.
##   -r FILE, --rest-file FILE
##                         When the adapter matches in the middle of a read, write
##                         the rest (after the adapter) to FILE.
##   --wildcard-file FILE  When the adapter has N wildcard bases, write adapter
##                         bases matching wildcard positions to FILE. (Inaccurate
##                         with indels.)
##   --too-short-output FILE
##                         Write reads that are too short (according to length
##                         specified by -m) to FILE. Default: discard reads
##   --too-long-output FILE
##                         Write reads that are too long (according to length
##                         specified by -M) to FILE. Default: discard reads
##   --untrimmed-output FILE
##                         Write reads that do not contain any adapter to FILE.
##                         Default: output to same file as trimmed reads
## 
## Paired-end options:
##   The -A/-G/-B/-U/-Q options work like their lowercase counterparts, but are
##   applied to R2 (second read in pair)
## 
##   -A ADAPTER            3' adapter to be removed from R2
##   -G ADAPTER            5' adapter to be removed from R2
##   -B ADAPTER            5'/3 adapter to be removed from R2
##   -U LENGTH             Remove LENGTH bases from R2
##   -Q [5'CUTOFF,]3'CUTOFF
##                         Quality-trimming cutoff for R2. Default: same as for R1
##   -p FILE, --paired-output FILE
##                         Write R2 to FILE.
##   --pair-adapters       Treat adapters given with -a/-A etc. as pairs. Either
##                         both or none are removed from each read pair.
##   --pair-filter {any,both,first}
##                         Which of the reads in a paired-end read have to match
##                         the filtering criterion in order for the pair to be
##                         filtered. Default: any
##   --interleaved         Read and/or write interleaved paired-end reads.
##   --untrimmed-paired-output FILE
##                         Write second read in a pair to this FILE when no adapter
##                         was found. Use with --untrimmed-output. Default: output
##                         to same file as trimmed reads
##   --too-short-paired-output FILE
##                         Write second read in a pair to this file if pair is too
##                         short.
##   --too-long-paired-output FILE
##                         Write second read in a pair to this file if pair is too
##                         long.

Create a directory for the cutadapt output files:

bash
mkdir tutorial/cutadapt

Before we perform read trimming, we need to get the sequence of the contaminating adapter. You can either ask the person who prepared the libraries for the adapter sequence they used or simply search online for the corresponding sequence. Helpfully, Illumina maintains an adapter sequences document which lists all of the adapters used in their library preparation kits.

Lets now run Cutadapt to trim the adapter sequence and low quality bases:

bash
# -q 28             | Trim low-quality bases from 3' end of each read
# -a AGATCGGAAGAGC  | Illumina adapter sequence ligated to the 3' end
# -m 26             | Discard reads shorter than 26 bases
cutadapt -q 28 -a AGATCGGAAGAGC -m 36 -o tutorial/cutadapt/trimmed.fastq.gz tutorial/data/contam.fastq.gz
## This is cutadapt 4.1 with Python 3.10.6
## Command line parameters: -q 28 -a AGATCGGAAGAGC -m 36 -o tutorial/cutadapt/trimmed.fastq.gz tutorial/data/contam.fastq.gz
## Processing single-end reads on 1 core ...
## Finished in 3.61 s (14 µs/read; 4.16 M reads/minute).
## 
## === Summary ===
## 
## Total reads processed:                 250,000
## Reads with adapters:                    23,895 (9.6%)
## 
## == Read fate breakdown ==
## Reads that were too short:              11,389 (4.6%)
## Reads written (passing filters):       238,611 (95.4%)
## 
## Total basepairs processed:    62,750,000 bp
## Quality-trimmed:              24,964,552 bp (39.8%)
## Total written (filtered):     36,658,201 bp (58.4%)
## 
## === Adapter 1 ===
## 
## Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 23895 times
## 
## Minimum overlap: 3
## No. of allowed errors:
## 1-9 bp: 0; 10-13 bp: 1
## 
## Bases preceding removed adapters:
##   A: 26.5%
##   C: 27.0%
##   G: 22.8%
##   T: 23.6%
##   none/other: 0.1%
## 
## Overview of removed sequences
## length   count   expect  max.err error counts
## 3    2209    3906.2  0   2209
## 4    397 976.6   0   397
## 5    397 244.1   0   397
## 6    295 61.0    0   295
## 7    276 15.3    0   276
## 8    114 3.8 0   114
## 9    78  1.0 0   70 8
## 10   66  0.2 1   57 9
## 11   37  0.1 1   37
## 12   41  0.0 1   36 5
## 13   53  0.0 1   52 1
## 14   77  0.0 1   68 9
## 15   110 0.0 1   102 8
## 16   137 0.0 1   116 21
## 17   714 0.0 1   696 18
## 18   829 0.0 1   815 14
## 19   65  0.0 1   63 2
## 20   96  0.0 1   94 2
## 21   56  0.0 1   54 2
## 22   108 0.0 1   104 4
## 23   48  0.0 1   45 3
## 24   42  0.0 1   37 5
## 25   176 0.0 1   167 9
## 26   75  0.0 1   74 1
## 27   90  0.0 1   81 9
## 28   405 0.0 1   397 8
## 29   285 0.0 1   264 21
## 30   548 0.0 1   540 8
## 31   51  0.0 1   41 10
## 32   103 0.0 1   95 8
## 33   115 0.0 1   103 12
## 34   820 0.0 1   790 30
## 35   1033    0.0 1   1020 13
## 36   133 0.0 1   129 4
## 37   57  0.0 1   57
## 38   28  0.0 1   20 8
## 39   142 0.0 1   139 3
## 40   297 0.0 1   291 6
## 41   69  0.0 1   58 11
## 42   594 0.0 1   570 24
## 43   1474    0.0 1   1457 17
## 44   200 0.0 1   189 11
## 45   216 0.0 1   211 5
## 46   74  0.0 1   65 9
## 47   222 0.0 1   208 14
## 48   664 0.0 1   651 13
## 49   134 0.0 1   128 6
## 50   69  0.0 1   65 4
## 51   45  0.0 1   43 2
## 52   42  0.0 1   41 1
## 53   89  0.0 1   81 8
## 54   525 0.0 1   500 25
## 55   761 0.0 1   756 5
## 56   38  0.0 1   30 8
## 57   118 0.0 1   117 1
## 58   64  0.0 1   62 2
## 59   17  0.0 1   16 1
## 60   123 0.0 1   123
## 61   149 0.0 1   149
## 62   74  0.0 1   66 8
## 63   535 0.0 1   527 8
## 64   341 0.0 1   331 10
## 65   524 0.0 1   497 27
## 66   1652    0.0 1   1627 25
## 67   426 0.0 1   422 4
## 68   256 0.0 1   253 3
## 69   361 0.0 1   349 12
## 70   493 0.0 1   476 17
## 71   740 0.0 1   711 29
## 72   816 0.0 1   796 20
## 73   626 0.0 1   617 9
## 74   426 0.0 1   420 6
## 75   209 0.0 1   205 4
## 76   72  0.0 1   70 2
## 77   13  0.0 1   13
## 78   4   0.0 1   3 1
## 80   1   0.0 1   1
## 82   1   0.0 1   1
## 85   1   0.0 1   0 1
## 86   1   0.0 1   0 1
## 87   2   0.0 1   2
## 89   1   0.0 1   1
## 90   2   0.0 1   2
## 92   2   0.0 1   2
## 93   2   0.0 1   2
## 102  1   0.0 1   1
## 103  1   0.0 1   0 1
## 104  2   0.0 1   2
## 105  4   0.0 1   4
## 106  1   0.0 1   1
## 111  1   0.0 1   1
## 120  1   0.0 1   1
## 123  1   0.0 1   0 1
## 125  1   0.0 1   0 1
## 126  1   0.0 1   0 1
## 136  1   0.0 1   1
## 138  1   0.0 1   0 1
## 139  1   0.0 1   0 1
## 148  1   0.0 1   0 1
## 162  1   0.0 1   0 1
## 164  1   0.0 1   0 1
## 169  1   0.0 1   0 1
## 179  1   0.0 1   0 1
## 188  1   0.0 1   0 1
## 198  1   0.0 1   0 1

Activate the fastqc environment:

bash
conda activate fastqc

Run FastQC to generate a quality control report after trimming:

bash
fastqc -o tutorial/fastqc tutorial/cutadapt/trimmed.fastq.gz
## Started analysis of trimmed.fastq.gz
## Approx 5% complete for trimmed.fastq.gz
## Approx 10% complete for trimmed.fastq.gz
## Approx 15% complete for trimmed.fastq.gz
## Approx 20% complete for trimmed.fastq.gz
## Approx 25% complete for trimmed.fastq.gz
## Approx 30% complete for trimmed.fastq.gz
## Approx 35% complete for trimmed.fastq.gz
## Approx 40% complete for trimmed.fastq.gz
## Approx 45% complete for trimmed.fastq.gz
## Approx 50% complete for trimmed.fastq.gz
## Approx 55% complete for trimmed.fastq.gz
## Approx 60% complete for trimmed.fastq.gz
## Approx 65% complete for trimmed.fastq.gz
## Approx 70% complete for trimmed.fastq.gz
## Approx 75% complete for trimmed.fastq.gz
## Approx 80% complete for trimmed.fastq.gz
## Approx 85% complete for trimmed.fastq.gz
## Approx 90% complete for trimmed.fastq.gz
## Approx 95% complete for trimmed.fastq.gz
## Analysis complete for trimmed.fastq.gz

Inspect the trimmed quality control report:

Output

You should see that FastQC no longer reports any issues with the per base sequence quality or adapter contamination. However, we still receive a warning for the per sequence GC content. Is this a problem? Remember what I said previously about blindly following the FastQC test results. Does it matter that the GC content is different to this ‘theoretical’ distribution? What if the data we generated was from an assay that sampled just high GC regions on the genome? An understanding of the data generating process is crucial to assess the FastQC test results in their proper context. Having said that, it appears we have successfully ‘cleaned’ this data and can proceed with our analysis.

4 Exercises

The exercises below are designed to strengthen your knowledge of using FastQC and Cutadapt to perform quality control on sequencing data. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.

Create a directory to store the output files from each exercise:

bash
mkdir exercises
mkdir exercises/chipseq
mkdir exercises/rnaseq
mkdir exercises/atacseq

4.1 ChIP sequencing

ChIP sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest.

In the data directory is a file called chipseq.fastq.gz which contains sequencing data from a ChIP-seq experiment. Use FastQC and Cutadapt to quality check and clean the data if necessary. Pay close attention to the ‘Overrepresented sequences’ module. Can you clean the data so it passes this module?

bash
# Run FastQC on the raw data
conda activate fastqc
fastqc -o exercises/chipseq tutorial/data/chipseq.fastq.gz

# Trim the Illumina TruSeq adapter sequence
# Remove over represented sequences containing any N base calls
conda activate cutadapt
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --max-n 1 -o exercises/chipseq/chipseq_trimmed.fastq.gz tutorial/data/chipseq.fastq.gz

# Run FastQC on the trimmed data
conda activate fastqc
fastqc -o exercises/chipseq exercises/chipseq/chipseq_trimmed.fastq.gz
## Started analysis of chipseq.fastq.gz
## Approx 5% complete for chipseq.fastq.gz
## Approx 10% complete for chipseq.fastq.gz
## Approx 15% complete for chipseq.fastq.gz
## Approx 20% complete for chipseq.fastq.gz
## Approx 25% complete for chipseq.fastq.gz
## Approx 30% complete for chipseq.fastq.gz
## Approx 35% complete for chipseq.fastq.gz
## Approx 40% complete for chipseq.fastq.gz
## Approx 45% complete for chipseq.fastq.gz
## Approx 50% complete for chipseq.fastq.gz
## Approx 55% complete for chipseq.fastq.gz
## Approx 60% complete for chipseq.fastq.gz
## Approx 65% complete for chipseq.fastq.gz
## Approx 70% complete for chipseq.fastq.gz
## Approx 75% complete for chipseq.fastq.gz
## Approx 80% complete for chipseq.fastq.gz
## Approx 85% complete for chipseq.fastq.gz
## Approx 90% complete for chipseq.fastq.gz
## Approx 95% complete for chipseq.fastq.gz
## Approx 100% complete for chipseq.fastq.gz
## Analysis complete for chipseq.fastq.gz
## This is cutadapt 4.1 with Python 3.10.6
## Command line parameters: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --max-n 1 -o exercises/chipseq/chipseq_trimmed.fastq.gz tutorial/data/chipseq.fastq.gz
## Processing single-end reads on 1 core ...
## Finished in 0.69 s (7 µs/read; 8.69 M reads/minute).
## 
## === Summary ===
## 
## Total reads processed:                 100,000
## Reads with adapters:                     1,670 (1.7%)
## 
## == Read fate breakdown ==
## Reads with too many N:                     640 (0.6%)
## Reads written (passing filters):        99,360 (99.4%)
## 
## Total basepairs processed:     3,751,129 bp
## Total written (filtered):      3,707,971 bp (98.8%)
## 
## === Adapter 1 ===
## 
## Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA; Type: regular 3'; Length: 33; Trimmed: 1670 times
## 
## Minimum overlap: 3
## No. of allowed errors:
## 1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3
## 
## Bases preceding removed adapters:
##   A: 25.4%
##   C: 15.5%
##   G: 20.4%
##   T: 12.3%
##   none/other: 26.3%
## 
## Overview of removed sequences
## length   count   expect  max.err error counts
## 3    1041    1562.5  0   1041
## 4    150 390.6   0   150
## 5    32  97.7    0   32
## 6    2   24.4    0   2
## 9    1   0.4 0   0 1
## 10   1   0.1 1   0 1
## 37   5   0.0 3   0 1 4
## 38   438 0.0 3   0 416 20 2
## Started analysis of chipseq_trimmed.fastq.gz
## Approx 5% complete for chipseq_trimmed.fastq.gz
## Approx 10% complete for chipseq_trimmed.fastq.gz
## Approx 15% complete for chipseq_trimmed.fastq.gz
## Approx 20% complete for chipseq_trimmed.fastq.gz
## Approx 25% complete for chipseq_trimmed.fastq.gz
## Approx 30% complete for chipseq_trimmed.fastq.gz
## Approx 35% complete for chipseq_trimmed.fastq.gz
## Approx 40% complete for chipseq_trimmed.fastq.gz
## Approx 45% complete for chipseq_trimmed.fastq.gz
## Approx 50% complete for chipseq_trimmed.fastq.gz
## Approx 55% complete for chipseq_trimmed.fastq.gz
## Approx 60% complete for chipseq_trimmed.fastq.gz
## Approx 65% complete for chipseq_trimmed.fastq.gz
## Approx 70% complete for chipseq_trimmed.fastq.gz
## Approx 75% complete for chipseq_trimmed.fastq.gz
## Approx 80% complete for chipseq_trimmed.fastq.gz
## Approx 85% complete for chipseq_trimmed.fastq.gz
## Approx 90% complete for chipseq_trimmed.fastq.gz
## Approx 95% complete for chipseq_trimmed.fastq.gz
## Analysis complete for chipseq_trimmed.fastq.gz

4.2 RNA sequencing

RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome.

In the data directory is a file called `rnaseq.fastq.gz`` which contains sequencing data from an RNA-seq experiment. Use FastQC and Cutadapt to quality check and clean the data if necessary. Pay close attention to the ‘Per sequence GC content’ and ‘Overrepresented sequences’ modules. Can you clean the data so both of these modules pass? Should you?

bash
# This is an RNA-seq library.
# Depending on the cell type and condition, the GC content of expressed genes might be very skewed.
# The assumptions of the theoretical distribution do not hold for this particular type of data.
# Additionally, it is not clear whether the over-represented sequences are from contamination.
# In an RNA-seq library we expect to see thousands of copies of the same DNA fragment, because a particular gene might be highly expressed.
# The over-represented sequences you observe actually come from a highly expressed gene in the library, not contamination.

4.3 ATAC sequencing

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a technique used in molecular biology to assess genome-wide chromatin accessibility. ATAC-seq identifies accessible DNA regions by probing open chromatin with hyperactive mutant Tn5 Transposase that inserts sequencing adapters into open regions of the genome.

In the data directory is a file called atacseq.fastq.gz which contains sequencing data from an ATAC-seq experiment. Use FastQC and Cutadapt to quality check and clean the data if necessary. Pay close attention to the ‘Per base sequence content’ and ‘Overrepresented sequences’ modules. Can you clean the data so both of these modules pass? If not, think about why?

bash
# Run FastQC on the raw data
conda activate fastqc
fastqc -o exercises/atacseq tutorial/data/atacseq.fastq.gz

# Trim the Illumina TruSeq adapter sequence
# Remove over represented sequences containing N base calls
conda activate cutadapt
cutadapt --max-n 1 -o exercises/atacseq/atacseq_trimmed.fastq.gz tutorial/data/atacseq.fastq.gz

# Run FastQC on the trimmed data
conda activate fastqc
fastqc -o exercises/atacseq exercises/atacseq/chipseq_trimmed.fastq.gz

# The 'Per base sequence content' module fails because of the strange pattern at the beginning of the read.
# This is not due to contamination or a probably with the library.
# The ATAC-seq protocol uses a Tn5 transposase which inserts itself into regions of open DNA.
# Those regions of open DNA are what go on to be sequenced.
# The strange pattern you observe is due to a bias in the Tn5 transposase itself - it has a tendency to insert itself into regions of the DNA with a particular sequence composition.
## Started analysis of atacseq.fastq.gz
## Approx 5% complete for atacseq.fastq.gz
## Approx 10% complete for atacseq.fastq.gz
## Approx 15% complete for atacseq.fastq.gz
## Approx 20% complete for atacseq.fastq.gz
## Approx 25% complete for atacseq.fastq.gz
## Approx 30% complete for atacseq.fastq.gz
## Approx 35% complete for atacseq.fastq.gz
## Approx 40% complete for atacseq.fastq.gz
## Approx 45% complete for atacseq.fastq.gz
## Approx 50% complete for atacseq.fastq.gz
## Approx 55% complete for atacseq.fastq.gz
## Approx 60% complete for atacseq.fastq.gz
## Approx 65% complete for atacseq.fastq.gz
## Approx 70% complete for atacseq.fastq.gz
## Approx 75% complete for atacseq.fastq.gz
## Approx 80% complete for atacseq.fastq.gz
## Approx 85% complete for atacseq.fastq.gz
## Approx 90% complete for atacseq.fastq.gz
## Approx 95% complete for atacseq.fastq.gz
## Approx 100% complete for atacseq.fastq.gz
## Analysis complete for atacseq.fastq.gz
## This is cutadapt 4.1 with Python 3.10.6
## Command line parameters: --max-n 1 -o exercises/atacseq/atacseq_trimmed.fastq.gz tutorial/data/atacseq.fastq.gz
## Processing single-end reads on 1 core ...
## Finished in 0.81 s (8 µs/read; 7.39 M reads/minute).
## 
## === Summary ===
## 
## Total reads processed:                 100,000
## 
## == Read fate breakdown ==
## Reads with too many N:                   1,784 (1.8%)
## Reads written (passing filters):        98,216 (98.2%)
## 
## Total basepairs processed:     7,117,244 bp
## Total written (filtered):      7,053,624 bp (99.1%)
## Skipping 'exercises/atacseq/chipseq_trimmed.fastq.gz' which didn't exist, or couldn't be read